## 'data.frame': 4898 obs. of 13 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 0.045 0.014 0.03 0.047 0.047 0.03 0.03 0.045 0.014 0.028 ...
## $ total.sulfur.dioxide: num 0.17 0.132 0.097 0.186 0.186 0.097 0.136 0.17 0.132 0.129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## $ used.sulfur.dioxide : num 0.125 0.118 0.067 0.139 0.139 0.067 0.106 0.125 0.118 0.101 ...
This data set on white wine quality has 4898 observations of 13 variables. Quality is the output observation and is the only integer in this dataset. The rest of the variables are decimal point numbers. I want to see if quality can be estimated based on any of these variables.
## [1] "Quality"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
## [1] "High Quality Wines"
## [1] 1060
## [1] "Low Quality Wines"
## [1] 1640
I grouped the wines rated 7 or greater and grouped the wines rated 5 or lower. Close to 21% of the wines are high quality and 33% are low quality. I’m hoping to find similarities in these groups and compare them to each other and the dataset as a whole. The histogram showing the quality of the white wines has a fairly normal distribution.
## [1] "Fixed Acidity"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
## [1] "Volatile Acidity"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Both Fixed and Volatile Acidity have fairly normal distributions. Volatile Acidity has many more outliers, but both of these variales have small interquartile ranges. It’s interesting that the IQR makes up about 10% of the range for both of these variables. This seems very low and speaks to the variability in this dataset.
## [1] "Citric Acid"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
## [1] "pH"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
Citric Acid and pH again have fairly normal distributions with quite a few outliers. This time though Citric Acid has an IQR that makes up only 7% of the range while pH is closer to 17%. I wonder if high quality wines have values that fall within these small IQR’s.
## [1] "Sulphates"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
## [1] "Sulfur Dioxide"
## [1] "Free Sulfur Dioxide"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00200 0.02300 0.03400 0.03531 0.04600 0.28900
## [1] "Used Sulfur Dioxide"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0040 0.0780 0.1000 0.1031 0.1250 0.3310
## [1] "Total Sulfur Dioxide"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0090 0.1080 0.1340 0.1384 0.1670 0.4400
So far very little has stood out as abnormal in these variables. Sulphates and Sulfur Dioxide levels are no different, with normal distributions and a similar number of outliers compared to the other variables. Free Sulfur Dioxide is the only variable with a small IQR making up about 8% of the full range. The others have much larger IQR’s.
## [1] "Chlorides"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
## [1] "Residual Sugar"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
These two variables are much different. Chlorides has a fairly normal distribution with a bit of a positive skew. It has the smallest IQR when compared to range at about 4%. Residual Sugar on the other hand is very positively skewed, but has a large amount of data at low levels.
## [1] "Density"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
## [1] "Alcohol"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
For these last two variables they are again distributed quite normally, with the exception that alcohol has a bit of a positive skew. Nothing else really stands out.
This dataset contains 4989 observations of 13 variables. All of the 12 input variables are decimal point numbers and the one output variable is an integer. With the exception of sugar, all of the variables are found in very small quantities - less than 1 g / dm^3.
Quality ranges from 3 to 9 with a median of 6 and a mean of 5.878.
The main feature of interest is our output variable - quality. The goal is to determine which is the measured variables affect the overall quality. The attribute details state that the output is based on sensory data which I assume to mean quality is based on taste and smell of the wine (more so than hangover level or overall price value).
After doing a little research to determine what affects the taste of wine, it appears that acidity and sulfur dioxide levels may be a good indicator. Acidity causes food and drinks to taste more zesty. Citric acid may add to the sweetness but fixed and volatile acidity will add more zest. During the fermentation process, maintaining appropriate oxygen levels is important as oxygen causes reactions in the other chemicals in the wine. This reaction causes the fruit in the wine to lose it’s aroma. Sulfur Dioxide is added as an antioxidant to help reduce reactions between molecules.
From what I saw above, Chlorides, Alcohol, and Sugar levels were the most different of all the variables. I think they would be good places to start.
I created a variable for the used sulfur dioxide by taking the total sulfur dioxide and subtracting the free sulfur dioxide in each wine. I did this because sulfur dioxide may be useful in determining quality, however I’m unsure whether available or used sulfur dioxide has more affect.
I also changed the values for the sulfur dioxide variables so that they were measured in the same units as the rest of the variables. And I removed the variable ‘X’ which was just an observation ID.
None of the distributions jumped out as unusual. There are however a lot of outliers in nearly each variable. The only change I made was to create datasets containing the best and worst quality wines. I think finding relationships between wines of similar quality will be more appropriate than trying to determine the same thing over an entire group full of outliers.
Here is the correlation plot for White Wine. The most correlated variable with quality appears to be alcohol with a value of 0.43. This will be a good start for our analysis. Alcohol also appears to be at least partially correlated with sugar, chlorides, total sulfur dioxide, and density.
Density has two of the highest correlation values when compared to sugar and alcohol with values of 0.84 and -0.78 respectively. I think these two directly affect the density of wine and won’t be used towards the quality.
## [1] "Correlation between Quality and Alcohol"
## [1] 0.4355747
## [1] "alcohol"
## Low Quality All Wine High Quality
## Min " 8.00 " " 8.00 " " 8.50 "
## 1st Q " 9.20 " " 9.50 " "10.70 "
## Median " 9.60 " "10.40 " "11.50 "
## Mean " 9.85 " "10.51 " "11.42 "
## 3rd Q "10.40 " "11.40 " "12.40 "
## Max "13.60 " "14.20 " "14.20 "
The most correlated variable with quality is alcohol. I like using geom_jitter with an alpha value because it’s a little easier to understand visually. There is definitely an upward trend seen between quality and alcohol. High quality wines generally have higher alcohol contents when compared to the low quality wines.
## [1] "Sugar"
## Low Quality All Wine High Quality
## Min " 0.600" " 0.600" " 0.800"
## 1st Q " 1.700" " 1.700" " 1.800"
## Median " 6.625" " 5.200" " 3.875"
## Mean " 7.054" " 6.391" " 5.262"
## 3rd Q "11.025" " 9.900" " 7.400"
## Max "23.500" "65.800" "19.250"
## [1] "Density"
## Low Quality All Wine High Quality
## Min "0.9872" "0.9871" "0.9871"
## 1st Q "0.9932" "0.9917" "0.9905"
## Median "0.9951" "0.9937" "0.9917"
## Mean "0.9952" "0.9940" "0.9924"
## 3rd Q "0.9971" "0.9961" "0.9936"
## Max "1.0024" "1.0390" "1.0006"
## [1] "Correlation between Density and Sugar"
## [1] 0.8389665
## [1] "Correlation between Density and Alcohol"
## [1] -0.7801376
The box plots clearly show the change of a few key measurements between high and low quality wine. High quality wines generally have a smaller IQR and range. This is similar to what was found before when looking at single variables.
Since density is so heavily correlated to alcohol and sugar, it won’t be used to determine quality. Sugar and alcohol are qualities that are more likely to have a direct affect.
## [1] "Chlorides"
## Low Quality All Wine High Quality
## Min "0.0090" "0.0090" "0.0120"
## 1st Q "0.0400" "0.0360" "0.0310"
## Median "0.0470" "0.0430" "0.0370"
## Mean "0.0514" "0.0457" "0.0381"
## 3rd Q "0.0530" "0.0500" "0.0440"
## Max "0.3460" "0.3460" "0.1350"
## [1] "Correlation"
## [1] -0.3601887
## [1] "Total Sulfur Dioxide"
## Low Quality All Wine High Quality
## Min "0.0090" "0.0090" "0.0340"
## 1st Q "0.1170" "0.1080" "0.1010"
## Median "0.1490" "0.1340" "0.1220"
## Mean "0.1486" "0.1384" "0.1252"
## 3rd Q "0.1820" "0.1670" "0.1460"
## Max "0.4400" "0.4400" "0.2290"
## [1] "Correlation"
## [1] -0.4488921
## [1] "Correlation"
## [1] -0.4506312
Chlorides in high and low quality wine actually have the same sized IQR but low quality wine has many more outliers. A specific chloride range isn’t enough to determine quality but it definitely plays a role. We already saw that high quality wines tend to have a high alcohol content, making the correlation between chlorides and alcohol very helpful.
The same can be said for the total sulfur dioxide and sugar plots, but they have slightly less change between quality levels. Either way it appears that these variables are the most likely to affect the quality level.
The biggest factor I’ve seen so far in determining quality level is alcohol. I didn’t expect wine quality to be most correlated with alcohol. Comparing alcohol against other variables is offering insights into secondary or tertiary variables that affect quality. Other than alcohol, the three most useful variables are chlorides, total sulfur dioxide, and sugar. Seeing the range in the box plots change between quality levels was very useful in selecting useful variables.
I checked to make sure that density levels were caused by other variiables more than quality. Density is indeed very correlated with both sugar and alcohol. This was important to check because density was one of the more correlated variables with quality when looking at the whole dataset.
The strongest relationship is definitely between alcohol and chlorides, followed by alcohol and total sulfur dioxide. I think I’ll look closer at these variables to see what else I can determine.
## [1] "High Quality"
## [1] "Low Quality"
Looking at the high quality correlation plot, there are three major differences I see when comparing against alcohol. Fixed Acidity, Volatile Acidity, and Chlorides all increase greatly.
## [1] "Correlation between Chlorides and Alcohol in High Quality Wine"
## [1] -0.5436734
## [1] "Correlation between Chlorides and Alcohol in Low Quality Wine"
## [1] -0.2432591
Alcohol and Chlorides continue to show themselves as the strongest pairs of factors in discerning quality level. The curve representing the high quality wine becomes more horizontal, strengthening the idea that there is a strict range for the variables in high quality wine.
## [1] "Total Sulfur Dioxide"
## [1] "Correlation of Total Sulfur Dioxide and Alcohol in High Quality Wine"
## [1] -0.4497517
## [1] "Correlation of Total Sulfur Dioxide and Alcohol in Low Quality Wine"
## [1] -0.4089919
## [1] "Residual Sugar"
## [1] "Correlation between Sugar and Alcohol in High Quality Wine"
## [1] -0.4839206
## [1] "Correlation between Sugar and Alcohol in Low Quality Wine"
## [1] -0.4335393
The plot for total sulfur dioxide vs alcohol shows the change between quality levels in a similar way that we saw with chlorides. High quality wines once again have a smaller range for their observed values. For the plot with sugar we see that the curves representing quality are quite similar. I think this tells us more about the relationship between sugar and alcohol than it does with quality.
The relationship I’ve observed between alcohol and chloride content seem to be the strongest pairing of variables. Chloride levels in high quality wine tend to stay within a certain low range. When combined with a high alcohol content they are generally seen within high quality wines.
One interesting thing I saw was the alcohol vs sugar plot. The volatility between both high and low quality wines is interesting. I was unsure whether this pattern was due to sugar not being a useful variable, or the fact that sugar was the most skewed of all observed variables.
This was the first instance found of a variable that affects quality. Alcohol is definitely one of the more useful variables in determining quality as well. The visual difference in the distribution makes this plot very easy to read and understand.
These plots showed how useful chloride count is as well in determining quality. Chloride count being correlated to Alcohol content also proved to be meaningful when going further in the analysis.
This final plot shows a definite difference in values of chlorides and alcohol and their affect on quality. All through out the analysis there were times where it seemed like high quality wines had more refined values in the observed variables. This plot is another good example of that.
I’m happy to have redone this project. I was happy with the work I did on the first submission but I was aware it needed some work. The feedback I received from the marker was very beneficial and helped to change some plots and analysis. I ended up changing much more (pretty much all of it) than I set out to do but I think there has been an overall improvement across all levels of analyses.
Getting through this project was a struggle to start but I’m happy to have finished. I tried exploring the data in a lot of different ways to start, really trying to find something unique. I got back on track once I started treating it more like a project to learn R with and to practice analyzing data. There are other relationships I would continue to look into if I was to continue with this project. I think I’ll realized early into the project how many different possible ways there is to explore data and I was a little discouraged. But going through the basics was enough to draw some general conclusions.
I really like how simple it is in R to explore so many different patterns. Once I get more familiar writing R and using some packages I’m excited to see how quickly I can find unique patterns in new data. Something I would have liked to explore more but didn’t was the high quality wines having strict ranges on several variables.